Spatiotemporal Patterns of Influenza Activity in the United States

BMIN503/EPID600 Final Project

Author

Grace Li

Published

December 13, 2023


Overview

Season influenza viruses continues to be a major global burden. There are around a billion cases annually and up to 650,000 people die of influenza every year. Influenza activity patterns vary geographically and temporally, so we want to examine the spatiotemporal patterns of influenza activity in the United States and determine if variations in influenza activities across states or regions are associated with distribution of circulating virus strains. I will be using surveillance data published by the Centers for Disease Control and Prevention (CDC) and viral sequences from the Global Initiative on Sharing All Influenza Data (GISAID) to answer these questions. Understanding patterns of seasonal influenza activity can help guide early influenza surveillance and inform influenza prevention and control programs. Materials for this project can be found in this GitHub repository.

Introduction

Influenza has a large seasonal burden in the United States: up to 35 million people are infected annually causing between 12,000 and 56,000 deaths. There are four types of influenza viruses, but influenza A and B viruses are responsible for seasonal epidemics in the human population. Influenza A viruses are divided into subtypes based on the two surface glycoproteins – hemagglutinin and neuraminidase. H1N1 and H3N2 are the main influenza A subtypes that cause infection in people. Similarly, influenza B viruses can be divided into lineages. Influenza viruses can be further subdivided into clades and subclades based on their hemagglutinin gene sequences for virus characterization. In the U.S., flu season typically starts on the 40th week of the year (1st week of October), but influenza activity including time of onset, duration, and number of peaks varies greatly across the states.

The CDC routinely publishes weekly reports of influenza surveillance data providing a national picture of influenza virus activities in the U.S. One aspect of these surveillance is performed by clinical laboratories across the country determining the percentage of specimens positive for influenza virus. These test results allow us to evaluate if influenza cases are increasing or decreasing. Efforts by public health laboratories to sequence influenza genomes have also allowed researchers to perform genetic characterization of influenza viruses using public databases like GISAID.

Influenza viruses are constantly changing, allowing the introduction of new clades and extinction of less fit strains. With sequencing data, we can determine influenza clade distribution by examining the number of clades present and the proportion of each clade at a given time. It is unclear if clade distribution pattern is associated with influenza activity such as flu season duration. By linking regional epidemiologic data to circulating influenza sequencing data, we hope to identify patterns of clade distribution associated with influenza activities in the U.S. An understanding of the spatial and temporal spread of influenza can help inform public health decisions in influenza prevention and control.

Methods

Data source

Datasets Time range Source
Influenza surveillance 2010-present FluView
Influenza sequencing data 2010-10-01 to 2019-10-01 GISAID

Influenza surveillance

We obtained influenza surveillance data from the CDC Weekly U.S. Influenza Surveillance Report. Viral surveillance from approximately 300 clinical laboratories in the U.S. World Health Organization (WHO) Collaborating Laboratories System and the National Respiratory and Enteric Virus Surveillance System (NREVSS) is reported weekly and data can be accessed through FluView Interactive. Data collection from both the U.S. WHO Collaborating Laboratories System and NREVSS began during the 1997-98 season. The total number of respiratory specimens tested and %positive rate for influenza are reported weekly.

For this project, I downloaded data from all 50 states starting from the 2010-11 season. Data from clinical laboratories include the weekly total number of specimens tested, the number of positive influenza tests, and the percent positive by influenza type.

GISAID

GISAID, the Global Initiative on Sharing All Influenza Data, is a global database that provides access to genomic data of influenza viruses and other pathogens including coronaviruses and poxvirus. Within the EpiFlu database in GISAID, I searched for human influenza A and influenza B virus isolates collected in the United States from October 2010 to October 2019. I further filtered isolate selection by requiring the isolate to contain complete sequences of both the hemagglutinin and neuraminidase segment and not have been passaged in cells or eggs. I downloaded the virus metadata that contains information including isolate name, subtype, and clade.

Downloading and cleaning data

First, we will download the weekly U.S. Influenza Surveillance Report.

# Load WHO/NREVSS surveillance data from clinical labs
clinical_labs <- read_csv("WHO_NREVSS_Clinical_Labs.csv")

Next, I will clean the surveillance data by removing unused variables and renaming variables. Influenza season typically starts on week 40 (first week of October), so I will include a variable named “ordered_week” that arrange weeks starting with week 40.

# Rename variables and order week variable
clinical_labs <- clinical_labs |>
        dplyr::select(-c("REGION TYPE")) |>
        dplyr::rename(state = REGION, 
               year = YEAR,
               week = WEEK,
               total_specimens = `TOTAL SPECIMENS`, 
               total_A = `TOTAL A`,
               total_B = `TOTAL B`,
               percent_positive = `PERCENT POSITIVE`,
               percent_A = `PERCENT A`,
               percent_B = `PERCENT B`) |> 
        mutate(ordered_week = (week - 40) %% 52 + 1) |>
        mutate(season = ifelse(week >= 40, year, year - 1)) |>
        mutate(ordered_week = as.factor(ordered_week)) |>
        mutate(percent_positive = as.numeric(percent_positive))

Since I will be drawing maps, I will retrieve shapefile for all states using tigris() and obtain information on states using the R Datasets Package.

# Retrieve shapefile for all states using tigirs() 
states <- states()

# Obtain state information on names and division using the package datasets
states_division <- data.frame(NAME = datasets::state.name, 
                              division = datasets::state.division)

# Merge shapefile with state information 
states <- inner_join(states, states_division, by = "NAME") |>
            rename(state = NAME)

Now I will merge the shapefile data frame with surveillance data.

# Include state division info in clinical_labs df  
clinical_labs <- inner_join(states, clinical_labs, by = "state")

Next, I am going to download influenza sequencing data from GISAID.

# Load data for influenza A and influenza B viruses
gisaid_a <- read_csv("gisaid_epiflu_isolates_fluA.csv")
gisaid_b <- read_csv("gisaid_epiflu_isolates_fluB.csv")

# Combine data for influenza A and influenza B viruses
gisaid <- bind_rows(gisaid_a, gisaid_b)

# Select relevant variables and rename variables
gisaid <- gisaid |>
            dplyr::select(c("Isolate_Id", "Isolate_Name", 
                            "Subtype", "Clade", "Collection_Date")) |>
            rename(id = Isolate_Id, 
                   name = Isolate_Name,
                   subtype = Subtype,
                   clade = Clade,
                   collection_date = Collection_Date) 

# Convert 'collection_date' to Date format
gisaid$collection_date <- as.Date(gisaid$collection_date)

# Create a new column 'week' indicating the week of the year
gisaid$week <- week(gisaid$collection_date)

# Create a new column 'season' indicating the flu season from which the isolate was sequenced
gisaid$season <- ifelse(gisaid$week >= 40, year(gisaid$collection_date), 
                        year(gisaid$collection_date) - 1)

# Create a new column 'ordered_week' for graphing and convert 'ordered_week' to numeric type
gisaid <- gisaid |>
            mutate(ordered_week = (week - 40) %% 52 + 1) |>
            mutate(ordered_week = as.numeric(ordered_week))

# Create a new column 'state' to indicate where the virus isolate came from
gisaid <- gisaid |>
  mutate(state = sapply(strsplit(as.character(name), "/"), `[`, 2))

head(gisaid)
# A tibble: 6 × 9
  id         name  subtype clade collection_date  week season ordered_week state
  <chr>      <chr> <chr>   <chr> <date>          <dbl>  <dbl>        <dbl> <chr>
1 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
2 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-11-23         47   2010            8 Penn…
3 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
4 EPI_ISL_9… A/Fl… A / H3… 3C.2  2010-12-09         49   2010           10 Flor…
5 EPI_ISL_9… A/No… A / H3… 3C.2  2011-02-16          7   2010           20 Nort…
6 EPI_ISL_9… A/Te… A / H3… 3C.2  2011-02-09          6   2010           19 Texas

Next we will clean up the GISAID dataset.

# Remove rows where 'subtype' is "A", "A / H1N2", or "A / H3" (incomplete sequencing)
gisaid <- gisaid |> 
            filter(!(subtype %in% c("A", "A / H1N2", "A / H3")))

# Remove rows where 'clade' or 'collection_date' is NA or if clade is unassigned
gisaid <- gisaid |>
            filter(!is.na(clade) & !is.na(collection_date)) |>
            filter(!(clade %in% c("unassigned")))

# Define a vector containing the names of all 50 states
states_list <- c(
  "Alabama", "Alaska", "Arizona", "Arkansas", "California", "Colorado",
  "Connecticut", "Delaware", "Florida", "Georgia", "Hawaii", "Idaho",
  "Illinois", "Indiana", "Iowa", "Kansas", "Kentucky", "Louisiana", "Maine",
  "Maryland", "Massachusetts", "Michigan", "Minnesota", "Mississippi",
  "Missouri", "Montana", "Nebraska", "Nevada", "New Hampshire", "New Jersey",
  "New Mexico", "New York", "North Carolina", "North Dakota", "Ohio", "Oklahoma",
  "Oregon", "Pennsylvania", "Rhode Island", "South Carolina", "South Dakota",
  "Tennessee", "Texas", "Utah", "Vermont", "Virginia", "Washington", "West Virginia",
  "Wisconsin", "Wyoming"
)

# Keep only rows where the "state" column matches any state in the 50 states (exclude entries like New York City or District of Columbia)
gisaid <- gisaid |>
            filter(state %in% states_list)

head(gisaid)
# A tibble: 6 × 9
  id         name  subtype clade collection_date  week season ordered_week state
  <chr>      <chr> <chr>   <chr> <date>          <dbl>  <dbl>        <dbl> <chr>
1 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
2 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-11-23         47   2010            8 Penn…
3 EPI_ISL_9… A/Pe… A / H3… 3C.2  2010-12-07         49   2010           10 Penn…
4 EPI_ISL_9… A/Fl… A / H3… 3C.2  2010-12-09         49   2010           10 Flor…
5 EPI_ISL_9… A/No… A / H3… 3C.2  2011-02-16          7   2010           20 Nort…
6 EPI_ISL_9… A/Te… A / H3… 3C.2  2011-02-09          6   2010           19 Texas

Results

Exploratory data analysis

CDC surveillance data

I will start with some exploratory data analysis by looking at the CDC surveillance data in the 2017-18 flu season.

# Filter data from the 2017-18 season 
season2017 <- clinical_labs |>
                filter(season == "2017")

# Set up theme for graphing maps 
my_theme <- function() {
  theme_minimal() +                                  
  theme(axis.line = element_blank(),                 
        axis.text = element_blank(),                 
        axis.title = element_blank(),
        panel.grid = element_line(color = "white"),  
        legend.key.size = unit(0.4, "cm"),          
        legend.text = element_text(size = 8),       
        legend.title = element_text(size = 10),
        plot.title = element_text(size = 12))      
}

# Create a color palette function
myPalette <- colorRampPalette(brewer.pal(9, "YlOrRd"))

# Function to create a single map for a given week
create_weekly_map <- function(week_data) {
  
  weekly_map <- week_data |> 
    st_simplify() |>
    ggplot() +
    geom_sf(aes(fill = percent_positive)) + 
    my_theme() +
    labs(title = paste("Week", unique(week_data$week))) +
    scale_fill_gradientn(name = "% Positive", colours = myPalette(100),
                         limit = range(0, 70))
}

# Create a list to store individual plots
season2017_map_list <- list()

# Create plots for each week
unique_week <- unique(season2017$week)

for (ii in unique_week) {
  # Filter data for the specific week and shift geometry
  week_data <- season2017 |> 
    filter(week == ii) |>
    shift_geometry()
  
  # Create a weekly map and add it to the list
  weekly_map <- create_weekly_map(week_data)
  
  season2017_map_list[[length(season2017_map_list) + 1]] <- weekly_map
}
# Create animation using gifski()
for (i in seq_along(season2017_map_list)) {
  print(season2017_map_list[[i]])
}

Overall, we can see that the timing and duration of flu seasons varied in different parts of the country in the 2017-18 season. In some states, flu season began as early as October and November while some states had a later start but continued to have flu circulation until May. We can also see some of the potential issues we might run into using the CDC FluView data. For example, surveillance data is incomplete for some of the states or even completely absent in several states, and this problem is consistent across flu seasons. This is both an artifact of how I selected the surveillance data from FluView and the overall surveillance system in the United States.

Next let’s take a closer look of the CDC surveillance data by state.

# Percent positive cases in 2017-18 season 
season2017 |> 
  ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        facet_wrap(~ state, nrow = 5, ncol = 10) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette = "GnBu") +
        scale_x_discrete(breaks = c(1, 14, 27, 40, 52), 
                         labels = c(40, 1, 14, 27, 39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017-18 Influenza Positive Tests Reported to CDC") +
        theme_classic() +
        theme(legend.position = "none",
              strip.background = element_blank(),
              panel.border = element_rect(color = "black", fill = NA))

It is clear that influenza activity varied greatly across states in the 2017-18 season. States like Nevada had pretty early influenza detection whereas some states did not experience peak flu activity until later. Flu activity remained elevated in the spring in some regions such as the East North Central states. In addition, some states have longer flu season than others and some experienced several waves of influenza activity.

The U.S. Census Bureau has grouped the 50 states and the District of Columbia into nine divisions based on geographic proximity. I will also look at the surveillance data within each division to see if states in close proximity might exhibit similar flu activity patterns.

# Get unique divisions
unique_divisions <- unique(season2017$division)

# Create a list to store individual plots
season2017_plot_list <- list()

# Create plots for each division
for (i in unique_divisions) {
        
  division_data <- season2017 |> 
          filter(division == i)
  
  p <- ggplot(division_data, aes(x = ordered_week, y = percent_positive, fill = state)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ state, ncol = 4, nrow = 2, scales = "free") +
    scale_fill_manual(values = wes_palette("Zissou1", n = 8, type = "continuous")) +
    scale_x_discrete(breaks = c(1, 14, 27, 40, 52), labels = c(40, 1, 14, 27, 39)) +
    scale_y_continuous() +
    ylim(0, 66) +
    labs(x = "Week", y = "Percent Positive", title = i) +
    theme_classic() +
    theme(legend.position = "none",
          strip.background = element_blank(),
          panel.border = element_rect(color = "black", fill = NA),
          aspect.ratio = 1)
 season2017_plot_list[[length(season2017_plot_list) + 1]] <- p
}

print(season2017_plot_list)
[[1]]


[[2]]


[[3]]


[[4]]


[[5]]


[[6]]


[[7]]


[[8]]


[[9]]

In some divisions, flu activity was very similar between states. For example, New York and Pennsylvania had similar pattern of flu activities, peaking around week 5. East North Central states also had relatively homogeneous flu activity.

In contrast, states in the Mountain/West region had pretty distinct patterns. Nevada was one of the first states to start the 2017-18 flu season but flu activity did not start until later in the year in Utah. Interestingly, Idaho experienced two waves of influenza activity.

# Percent positive cases in 2017-18 season in Mountain
season2017 |>
        filter(division == "Mountain") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = state)) +
        geom_bar(stat = "identity") +
        facet_wrap(~ state) +
        scale_fill_manual(values = c('#855C75', '#D9AF6B', '#AF6458', '#736F4C', '#526A83', '#625377', '#68855C', 'deepskyblue4')) +
        scale_x_discrete(breaks = c(1, 14, 27, 40, 52), labels = c(40, 1, 14, 27, 39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", title = "2017-18 Mountain States Influenza Activity") +
        theme_classic() +
        theme(legend.position = "none",
              strip.background = element_blank(),
              panel.border = element_rect(color = "black", fill = NA))

# Percent positive cases in 2017-18 season in Idaho
season2017 |>
        filter(state == "Idaho") |>
        ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
        geom_bar(stat = "identity") +
        scale_fill_manual(values = "#AF6458") +
        scale_x_discrete(breaks = c(1, 14, 27, 40, 52), labels = c(40, 1, 14, 27, 39)) +
        scale_y_continuous() +
        labs(x = "Week", y = "Percent Positive", 
             title = "2017-18 Idaho Influenza Positive Tests") +
        theme_classic() +
        theme(legend.position = "none")

GISAID sequencing data

Now let’s take a look at the influenza sequencing data retrieved from GISAID.

# Group by 'subtype' and count the occurrences of each subtype
gisaid |>
  group_by(subtype) |>
  count()
# A tibble: 3 × 2
# Groups:   subtype [3]
  subtype      n
  <chr>    <int>
1 A / H1N1  5850
2 A / H3N2 10369
3 B         2206
# Create a table summary by 'state'
gisaid |> 
  select(-c(id, name, collection_date, week, clade, ordered_week, season)) |> 
  tbl_summary(by = state)
Characteristic Alabama, N = 1381 Alaska, N = 4981 Arizona, N = 3711 Arkansas, N = 1241 California, N = 7821 Colorado, N = 2911 Connecticut, N = 2331 Delaware, N = 2781 Florida, N = 7611 Georgia, N = 2041 Hawaii, N = 4871 Idaho, N = 1901 Illinois, N = 2791 Indiana, N = 2011 Iowa, N = 3261 Kansas, N = 1071 Kentucky, N = 1661 Louisiana, N = 2671 Maine, N = 1611 Maryland, N = 3111 Massachusetts, N = 1661 Michigan, N = 1,2751 Minnesota, N = 2561 Mississippi, N = 1791 Missouri, N = 1701 Montana, N = 2891 Nebraska, N = 1111 Nevada, N = 1591 New Hampshire, N = 1481 New Jersey, N = 2271 New Mexico, N = 1971 New York, N = 5561 North Carolina, N = 2401 North Dakota, N = 1861 Ohio, N = 3191 Oklahoma, N = 1631 Oregon, N = 1451 Pennsylvania, N = 1,3011 Rhode Island, N = 891 South Carolina, N = 1401 South Dakota, N = 2541 Tennessee, N = 2681 Texas, N = 1,3501 Utah, N = 2911 Vermont, N = 1231 Virginia, N = 2851 Washington, N = 1,5031 West Virginia, N = 1401 Wisconsin, N = 1,5951 Wyoming, N = 1251
subtype

















































    A / H1N1 48 (35%) 84 (17%) 127 (34%) 42 (34%) 315 (40%) 109 (37%) 70 (30%) 94 (34%) 222 (29%) 75 (37%) 148 (30%) 59 (31%) 94 (34%) 73 (36%) 89 (27%) 46 (43%) 60 (36%) 58 (22%) 52 (32%) 102 (33%) 66 (40%) 342 (27%) 92 (36%) 47 (26%) 54 (32%) 103 (36%) 38 (34%) 70 (44%) 43 (29%) 72 (32%) 94 (48%) 163 (29%) 75 (31%) 74 (40%) 114 (36%) 61 (37%) 51 (35%) 467 (36%) 43 (48%) 35 (25%) 74 (29%) 83 (31%) 359 (27%) 102 (35%) 36 (29%) 116 (41%) 465 (31%) 41 (29%) 457 (29%) 46 (37%)
    A / H3N2 79 (57%) 388 (78%) 162 (44%) 70 (56%) 374 (48%) 146 (50%) 95 (41%) 125 (45%) 363 (48%) 113 (55%) 233 (48%) 102 (54%) 131 (47%) 98 (49%) 187 (57%) 52 (49%) 71 (43%) 115 (43%) 78 (48%) 139 (45%) 75 (45%) 847 (66%) 113 (44%) 113 (63%) 99 (58%) 145 (50%) 61 (55%) 75 (47%) 79 (53%) 122 (54%) 87 (44%) 336 (60%) 129 (54%) 87 (47%) 184 (58%) 91 (56%) 77 (53%) 760 (58%) 37 (42%) 80 (57%) 127 (50%) 167 (62%) 884 (65%) 127 (44%) 66 (54%) 111 (39%) 982 (65%) 76 (54%) 1,034 (65%) 77 (62%)
    B 11 (8.0%) 26 (5.2%) 82 (22%) 12 (9.7%) 93 (12%) 36 (12%) 68 (29%) 59 (21%) 176 (23%) 16 (7.8%) 106 (22%) 29 (15%) 54 (19%) 30 (15%) 50 (15%) 9 (8.4%) 35 (21%) 94 (35%) 31 (19%) 70 (23%) 25 (15%) 86 (6.7%) 51 (20%) 19 (11%) 17 (10%) 41 (14%) 12 (11%) 14 (8.8%) 26 (18%) 33 (15%) 16 (8.1%) 57 (10%) 36 (15%) 25 (13%) 21 (6.6%) 11 (6.7%) 17 (12%) 74 (5.7%) 9 (10%) 25 (18%) 53 (21%) 18 (6.7%) 107 (7.9%) 62 (21%) 21 (17%) 58 (20%) 56 (3.7%) 23 (16%) 104 (6.5%) 2 (1.6%)
1 n (%)
# Create a table summary by 'season'
gisaid |> 
  select(-c(id, name, collection_date, week, clade, ordered_week, state)) |> 
  tbl_summary(by = season)
Characteristic 2010, N = 131 2012, N = 51 2013, N = 111 2014, N = 6131 2015, N = 2,5351 2016, N = 4,0881 2017, N = 4,4631 2018, N = 6,6841 2019, N = 131
subtype








    A / H1N1 2 (15%) 3 (60%) 1 (9.1%) 9 (1.5%) 1,135 (45%) 403 (9.9%) 1,268 (28%) 3,027 (45%) 2 (15%)
    A / H3N2 11 (85%) 2 (40%) 10 (91%) 594 (97%) 866 (34%) 2,876 (70%) 2,854 (64%) 3,148 (47%) 8 (62%)
    B 0 (0%) 0 (0%) 0 (0%) 10 (1.6%) 534 (21%) 809 (20%) 341 (7.6%) 509 (7.6%) 3 (23%)
1 n (%)

From the summaries, we can see that we have more influenza A sequences than influenza B sequences in the United States between 2010 to 2019, which is consistent with influenza circulation pattern since influenza A is more common overall. Similar to the issue we encountered with the surveillance data, we will also have varied numbers of isolate sequences from state to state depending on the number of laboratories and resources allocated for influenza genome sequencing. My selection criteria might have also introduced bias in sampling. Finally, if we break it down by seasons, we can see that we have more sequencing data from 2015 to 2018, so I will focus my analysis on those flu seasons.

Duration of flu season and clade distribution

I want to start by determining which states had the longest/shortest flu season for every season between 2015 and 2019. To calculate this, I am counting the number of weeks where more than 10% of specimens tested are positive for influenza virus for each state.

# Define the season of interest
seasons <- 2015:2018

# Extract relevant columns from clinical_labs and convert from sf to dataframe
clinical_labs_df <- clinical_labs |>
                      select(week, state, year, season,
                             percent_positive, ordered_week) |>
                      st_drop_geometry() # drops geometry and reclasses df
  

# Initialize an empty data frame to store flu season duration results
flu_duration <- data.frame()  

# Iterate over each season to find states with maximum and minimum flu season durations
for (ii in seasons) {
  
  # Filter and summarize data for each state in the current season
  season_data <- clinical_labs_df |>
                  filter(season == ii) |>
                  group_by(state) |>
                  summarize(flu_season_duration = sum(percent_positive > 10, 
                                                      na.rm = TRUE))
  
  # Find the state with the maximum flu season duration
  max_duration_state <- season_data |>
                          filter(flu_season_duration == 
                                 max(flu_season_duration, na.rm = TRUE)) |>
                          select(state)
  
  # Find the state with the minimum flu season duration
  min_duration_state <- season_data |>
                          filter(flu_season_duration == 
                                   min(flu_season_duration, na.rm = TRUE)) |>
                          select(state)

  # Append results to the flu_duration data frame
  flu_duration <- bind_rows(flu_duration, 
                            data.frame(Season = ii, 
                                       Max_State = max_duration_state$state, 
                                       Min_State = min_duration_state$state))
}

flu_duration
   Season Max_State     Min_State
1    2015    Oregon       Florida
2    2015    Oregon      Missouri
3    2015    Oregon    New Jersey
4    2015    Oregon  Rhode Island
5    2016     Maine        Alaska
6    2016     Maine       Florida
7    2016     Maine    New Jersey
8    2016     Maine  Rhode Island
9    2016     Maine       Wyoming
10   2017    Oregon       Florida
11   2017    Oregon New Hampshire
12   2017    Oregon    New Jersey
13   2017    Oregon  Rhode Island
14   2018   Vermont       Florida
15   2018   Vermont New Hampshire
16   2018   Vermont    New Jersey
17   2018   Vermont  Rhode Island
18   2018   Vermont       Wyoming

However, due to unequal sample sizes between states in each season, I want to make sure the issue of data incompleteness is not biasing our calculation above in finding the state with the longest and shortest duration of flu activity. Therefore, I am also calculating the ratio of weeks with greater than 10% positive influenza rate to the total number of weeks with non-missing values in order to account for states having different numbers of missing values. A state must also have data points for more than 42 weeks out of the 52 weeks in a season to be considered in the ranking.

# Initialize an empty data frame to store flu season duration ratio results
flu_duration_ratio <- data.frame()

# Iterate over each season to find states with maximum and minimum flu season duration ratios
for (ii in seasons) {
  
  # Filter and summarize data for each state in the current season
  season_data <- clinical_labs_df |>
                  filter(season == ii) |>
                  group_by(state) |>
                  summarize(
                    total_weeks = sum(!is.na(percent_positive)),
                    flu_season_duration = sum(percent_positive > 10, na.rm = TRUE)
                    ) |>
                  mutate(ratio = flu_season_duration / total_weeks)
  
  # Find the state with the maximum flu season duration ratio
  max_ratio_state <- season_data |>
                      filter(total_weeks > 42) |>
                      filter(ratio == max(ratio, na.rm = TRUE)) |>
                      select(state)
  
  # Find the state with the minimum flu season duration ratio
  min_ratio_state <- season_data |>
                      filter(total_weeks > 42) |>
                      filter(ratio == min(ratio, na.rm = TRUE)) |>
                      select(state)
  
  # Append results to the flu_duration_ratio data frame
  flu_duration_ratio <- bind_rows(flu_duration_ratio, 
                                  data.frame(Season = ii, 
                                             Max_State = max_ratio_state$state, 
                                             Min_State = min_ratio_state$state))
}

flu_duration_ratio
  Season Max_State    Min_State
1   2015    Oregon     Missouri
2   2016   Vermont   California
3   2016   Vermont     Kentucky
4   2017    Oregon South Dakota
5   2018   Vermont South Dakota

We can see that the two tables generated different lists of states with longest or shortest flu season duration, especially when determining states with shortest duration. I will use the table calculated with ratio to continue my analysis since it has taken into account the issue with missing values at varying levels.

Now we are ready to look at the clade distribution in states that have the longest or the shortest flu seasons. First, I will write a function that will graph a stacked line chart given the season, state, and influenza subtype (H1N1, H3N2, or influenza B) that will show us the clade distribution. Because I don’t have complete sequencing data for every week, I am graphing clade distribution by month. In addition, if there is missing data for a month, I would use the data from the previous month if available.

# Create a function to make stacked line chart 
get_plot <- function(gisaid, plot_season, plot_state, plot_subtype){
  
  # Filter gisaid df
  data1 <- gisaid |>
            filter(season %in% plot_season) |>
            filter(state %in% plot_state) |>
            filter(subtype %in% plot_subtype)
  
  # Create plot titles 
  state_count <- length(plot_state)

  plot_title <- paste0("Influenza clade distribution in ", 
                       ifelse(state_count == 50, "all 50 states", 
                              paste(plot_state, collapse = ", ")),
                       ', Subtype: ', 
                       plot_subtype,' \nSeason: ', 
                       as.character(paste(plot_season, collapse = " ")))
  
  # Make a column that is the rounded date to the month
  data1$date <- as.Date(format(data1$collection_date, "%Y-%m-01"))
  
  # Generate dates by week from min to max 
  d_min <- as.Date(paste0(min(as.character(plot_season)), "-10-01"))
  d_max <- as.Date(paste0(max(as.character(plot_season+1)), "-09-30")) # Since the season ends in September of the next year we do +1 and make the day the last day of September
  
  # Generate a sequence of dates for each week
  date_week <- seq(from = d_min, to = d_max, by = "week")
  
  # Round each date to the first day of the month
  date_week <- as.Date(format(date_week, "%Y-%m-01"))
  
  # Generate lists of unique clades
  clades_unique <- unique(data1$clade)
  
  # Make a blank df for every date and clade combination 
  data2 <- expand.grid(date = date_week, clade = clades_unique)
  
  # Add a column for prop_clade
  data2$prop_clade <- 0
  
  # Populate df based on data1 
  for(ii in 1:nrow(data2)){
    
    temp_w <- data2$date[ii] # Retrieve the week for this row
    temp_data <- subset(data1,data1$date == temp_w) # Select all the data from that week
    temp_table <- data.frame(table(temp_data$clade)) # Get a table of the frequency for the clades of that week
    temp_table$proportion <- temp_table$Freq / sum(temp_table$Freq) # Determine the proportion of that clade
    
    # Populate data2 based on the proportion of temp_table
    temp_c <- as.character(data2$clade[ii])  # Determine the clade of the current row
    temp_table2 <- subset(temp_table,temp_table$Var1 == temp_c) # Determine if that clade was found in the data that week
    if(nrow(temp_table2) > 0){ # If the clade was found for this week then save that clade's proportion to the data2 data frame
      data2$prop_clade[ii] <- temp_table2$proportion[1]
    }
  }
  
  # Populate data3 based on data2
  data3 <- data2
  data3$date_clade <- paste0(as.character(data3$date),'_', 
                             as.character(data3$clade))
  temp_date_prev <- d_min 
  
  # Loop through the month and if there is a month with no data, then use the data from the last month
  for(ii in 1:nrow(data3)){
    
    # Part 1: Detect which months have no data.
    temp_date <- data3$date[ii]
    temp_df <- subset(data3,data3$date == temp_date)
    temp_table <- data.frame(table(temp_df$prop_clade))
    # If there is data for this month then we will have more than one row in this data frame (after filtering out 0 for Var1)
    temp_table <- subset(temp_table,temp_table$Var1 != 0)
    
    # Part 2: Populate with last month's data.
    # If there is no data for this month (nrow == 0), then use the previous month's data.
    # Rationale: Determine which dates have no data, then look for the last month's data, make a lookup table which uses this week's data and then we populate data3 with that.
    if(nrow(temp_table) < 1){ # If there is no data
      # Since this week had no reported data, make a lookup table from the last month's data and populate this month's using it
      temp_last <- subset(data3,data3$date == temp_date_prev)
      # Modify the lookup table so that we can assign this month's date instead of last month's date
      temp_last$date <- temp_date
      temp_last$date_clade <- paste0(as.character(temp_last$date),
                                     '_',as.character(temp_last$clade))
      # Assign the data to be found in the table
      lookup_tab <- temp_last$prop_clade
      # Give the data names
      names(lookup_tab) <- temp_last$date_clade
      # Find indices where a match is found in the lookup table
      matching_indices <- match(data3$date_clade, names(lookup_tab))
      
      # Update the values in data3$prop_clade only where a match is found
      data3$prop_clade[!is.na(matching_indices)] <- lookup_tab[matching_indices[!is.na(matching_indices)]]
      
    }
    
    # Remember this as the previous date
    temp_date_prev <- temp_date
  }
  
  breaks_num <- 50
  label_count <- date_week
  
  # Create a gradient from red to violet for the stacked line chart 
  color_gradient <- colorRampPalette(c("red", "orange", "yellow", 
                                       "blue", "violet"))
  # Generate the colors for each clade
  color_clade <- color_gradient(length(clades_unique))
  
  # Specify the breaks for the x-axis labels
  breaks <- seq(min(data3$date), max(data3$date), by = "1 months")
  
  # Format breaks to display as "Month Year"
  formatted_labels <- format(breaks, "%b %Y")
  
  # Plot with custom x-axis labels
  plot <- ggplot(data3, aes(fill = clade, y = prop_clade, x = date, group = clade)) +
    geom_area(size = 0.5, color = "white") +
    labs(fill = "", x = "", y = "Proportion") +
    ggtitle(plot_title) +
    theme(axis.text = element_text(size = 9), 
          axis.text.x = element_text(angle = 45, hjust = 1),
          panel.grid.major = element_blank(), 
          panel.grid.minor = element_blank(),
          panel.background = element_blank(), 
          axis.line = element_line(colour = "black")) +
    guides(fill = guide_legend(title = "Clade")) +
    scale_fill_manual(values = color_clade) +
    scale_x_date(breaks = breaks, labels = formatted_labels)
  
  # Print the plot
  plot
}

I will start by looking at the clade distribution of all sequences collected between the 2015-16 season and the 2018-19 season.

# Define season and states
plot_season <- c(2015:2018)
plot_state <- states_list

# Make stacked line chart for all H1N1 sequences in all 50 states
plot_subtype <- "A / H1N1"
get_plot(gisaid, plot_season, plot_state, plot_subtype)

# Make stacked line chart for all H3N2 sequences in all 50 states
plot_subtype <- "A / H3N2"
get_plot(gisaid, plot_season, plot_state, plot_subtype)

# Make stacked line chart for all IBV sequences in all 50 states
plot_subtype <- "B"
get_plot(gisaid, plot_season, plot_state, plot_subtype)

For all three influenza subtypes, we can see that clade distribution changes over time, usually with new clades replacing existing clades. There are more diversity with H1N1 and H3N2, whereas influenza B viruses are less diverse.

Now let’s focus on the four influenza seasons with the most sequencing data from GISAID starting from 2015 and see if we can determine clade distribution patterns that are associated with duration of flu season. I will first define some functions to make graphs that show influenza percent positive tests and clade distribution of each subtype in states with the longest/shortest flu activity from each season.

# Graph percent positive test for state with longest duration of flu season
max_state_positive <- function(flu_season, max_state) {
  clinical_labs |>
    filter(season == flu_season, state == max_state) |>
    ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
    geom_bar(stat = "identity") +
    geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
    scale_fill_manual(values = "#526A83") +
    scale_x_discrete(breaks = c(1, 14, 27, 40, 52), 
                         labels = c(40, 1, 14, 27, 39)) +
    scale_y_continuous(limits = c(0, 60)) +
    
    labs(x = "Week", y = "Percent Positive", 
         title = paste(flu_season, max_state)) +
    theme_classic() +
    theme(legend.position = "none")
}

# Graph percent positive test for state with shortest duration of flu season
min_state_positive <- function(flu_season, min_state) {
  clinical_labs |>
    filter(season == flu_season, state == min_state) |>
    ggplot(aes(x = ordered_week, y = percent_positive, fill = division)) +
    geom_bar(stat = "identity") +
    geom_hline(yintercept = 10, linetype = "dashed", alpha = 0.6) +
    scale_fill_manual(values = "#68855C") +
    scale_x_discrete(breaks = c(1, 14, 27, 40, 52), 
                         labels = c(40, 1, 14, 27, 39)) +
    scale_y_continuous(limits = c(0, 60)) +
    labs(x = "Week", y = "Percent Positive", 
         title = paste(flu_season, min_state)) +
    theme_classic() +
    theme(legend.position = "none")
}
# Generate plots for all combinations of variables
all_subtype_plots <- function(flu_season, max_min_duration, subtypes){
  
  plots_list <- list()

  for (season in flu_season) {
  for (state in max_min_duration) {
    for (subtype in subtypes) {
      tryCatch({
        # Generate a plot for the current combination
        plot <- get_plot(gisaid, season, state, subtype)
        # Store the plot in the list
        plots_list[[paste(season, state, subtype)]] <- plot
      }, error = function(e) {
        # Print a message when there is an error (e.g., no data for the combination)
        cat("Error:", e$message, "\n")
      })
    }
  }
  }
  return(plots_list)
}

Now let’s look at the clade distribution of H1N1, H3N2, and influenza B viruses in the 2015-16, 2016-17, 2017-18, and 2018-19 season.

# Graph percent positive tests for states with the longest and shortest duration of flu season
max_2015 <- max_state_positive(2015, "Oregon")
min_2015 <- min_state_positive(2015, "Missouri")
plot_grid(max_2015, min_2015)

flu_season <- c(2015)
max_min_duration <- c("Oregon", "Missouri")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Call the function and store the result
plots_list <- all_subtype_plots(flu_season, max_min_duration, subtypes)

# Graph clade distribution of states with longest and shortest duration of flu season
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

It is difficult to compare the clade distribution pattern between Oregon (long flu season) and Missouri (short flu season) in the 2015-16 season as the same clade dominated in both states (6B.1 for H1N1 and V1A for influenza B respectively). This is expected since low clade diversity was observed in all 50 states in the 2015-16 season. For H3N2, there were more clades that circulated during the season but one clade dominated at a time for the most part, which was a pattern shared between Oregon and Missouri.

# Graph percent positive tests for states with the longest and shortest duration of flu season
max_2016 <- max_state_positive(2016, "Vermont")
min_2016_1 <- min_state_positive(2016, "California")
min_2016_2 <- min_state_positive(2016, "Kentucky")
plot_grid(max_2016, min_2016_1, min_2016_2, nrow = 1)

flu_season <- c(2016)
max_min_duration <- c("Vermont", "California", "Kentucky")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Call the function and store the result
plots_list <- all_subtype_plots(flu_season, max_min_duration, subtypes)
Error: replacement has 1 row, data has 0 
# Graph clade distribution of states with longest and shortest duration of flu season
plot_grid(plotlist = plots_list[1:2], ncol = 1)

plot_grid(plotlist = plots_list[3:5], ncol = 1)

plot_grid(plotlist = plots_list[6:8], ncol = 1)

For the 2016-17 season, Vermont has the longest flu season and California and Kentucky had the shortest flu season. We do not have H1N1 sequencing data for Vermont in 2016, so we couldn’t compare the H1N1 pattern. For H3N2, it seems like the number of circulating clades varied between states. For example, in March 2017, 3C.2a1, 3C.2a3, and 3C.3a circulated in Vermont whereas 3C.2a1 and 3C.2a2 circulated in California and 3C.2a1 dominated in Kentucky. However it is worth noting that many of these clades are relatively similar. There were also more H3N2 clade diversity in Vermont and California compared to Kentucky. For influenza B, though different clades dominated in Vermont vs California and Kentucky, only one clade dominated at at time.

# Graph percent positive tests for states with the longest and shortest duration of flu season
max_2017 <- max_state_positive(2017, "Oregon")
min_2017 <- min_state_positive(2017, "South Dakota")
plot_grid(max_2017, min_2017)

flu_season <- c(2017)
max_min_duration <- c("Oregon", "South Dakota")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Call the function and store the result
plots_list <- all_subtype_plots(flu_season, max_min_duration, subtypes)

# Graph clade distribution of states with longest and shortest duration of flu season
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

The pattern observed in the 2017-18 season was similar to that of 2015-16. Both Oregon (long flu season) and South Dakota (short flu season) had relatively less diversity and mostly one clade dominated at a time for each subtype.

# Graph percent positive tests for states with the longest and shortest duration of flu season
max_2018 <- max_state_positive(2018, "Vermont")
min_2018 <- min_state_positive(2018, "South Dakota")
plot_grid(max_2018, min_2018)

flu_season <- c(2018)
max_min_duration <- c("Vermont", "South Dakota")
subtypes <- c("A / H1N1", "A / H3N2", "B")

# Call the function and store the result
plots_list <- all_subtype_plots(flu_season, max_min_duration, subtypes)

# Graph clade distribution of states with longest and shortest duration of flu season
plot_grid(plotlist = plots_list[1:3], ncol = 1)

plot_grid(plotlist = plots_list[4:6], ncol = 1)

In the 2018-19 season, the distribution is relatively similar for H3N2 and influenza B between Vermont (long flu season) and South Dakota (short flu season). For H1N1, there were many different H1N1 clades present between October and April in South Dakota whereas Vermont experienced less diversity.

Conclusion

Influenza continues to be a major public health burden affecting millions of people annually. Understanding patterns of influenza activity can help guide influenza prevention and control programs. Using influenza epidemiologic data from CDC and sequencing data from GISAID, I was able to perform some exploratory analysis to examine influenza activity in the United States.

It is clear that influenza activities vary in different parts of the country and from season to season. When focusing on the four influenza seasons with the most sequencing data to examine influenza clade distribution, it was difficult to identify patterns associated with influenza season duration. Clade distribution varied greatly across states and over time. The clades that were present in each state were also different, making it difficult to interpret if the differences in flu duration were due to the number of clades that were circulating or the presence of a specific clade in the state.

However, there are several limitations to this study. One of the biggest issues is that data completeness varied substantially across states. There are different testing practices by reporting labs so it might not be appropriate to compare the magnitude of positivity rates between states or seasons. This also applies to the clade information from the GISAID sequencing metadata. I had to use imputation to replace missing data and therefore the clade distribution probably does not reflect the actual frequency of circulating strains. There are also probably better ways to do the imputation and will be worth exploring for future direction.